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Abstract. Despite having a large number of speakers, the Kurdish lan- 
guage is among the less-resourced languages. In this work we highlight 
the challenges and problems in providing the required tools and tech- 
niques for processing texts written in Kurdish. From a high-level per- 
spective, the main challenges are: the inherent diversity of the language, 
standardization and segmentation issues, and the lack of language re- 
sources. 

1 Introduction 

Kurdish language belongs to the Indo-Iranian family of Indo-European lan- 
guages. Its closest better-known relative is Persian. Kurdish is spoken in Kurdis- 
tan, a large geographical area spanning the intersections of Iran, Iraq, Turkey, 
and Syria. It is one of the two official languages in Iraq and has a regional status 
in Iran. 

Despite having 20 to 30 millions of speaker^ Kurdish is among the less- 
resourced languages and there are very few tailor-made tools available for pro- 
cessing texts written in this language. Similarly, it has not seen much attention 
from the IR and NLP research communities and the existing literature can be 
summarized as a few attempts in building corpus and lexicon for Kurdish )2|8j . 
In order to provide the basic tools and techniques for processing Kurdish, we 
have recently launched a project at University of Kurdistan (UoK^J. This paper 
gives an overview of the main challenges that we need to address throughout this 
project. 

Before proceeding to enumerate the challenges, we would first like to high- 
light a few things about the scope and methodology of the current paper. Firstly, 
in this work we only consider the two largest and closely-related branches of Kur- 
dish -namely Kurmanji (or Northern Kurdish) and Sorani (or Central Kurdish)- 
and exclude the other smaller and distant dialects. Secondly, in the interest of 
space, we give greater importance to the issues that are specific to Kurdish and 
refer to the related papers for in-depth discussion of issues that are shared be- 
tween Kurdish and other languages (in particular, Persian, Arabic, and Urdu). 
Finally, we restrict our discussion to the bag-of- words (i.e., IR) model and do 
not address the structural (i.e., NLP) aspects. 

1 Numbers vary, de pending on the source. 

2 Project's website: http://eng.uok.ac.ir/esmaili/research/klpp/en/main.htm 
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2 Challenges 

We have categorized the main challenges into five groups. While the first two 
groups are concerned with the diversity aspect of the Kurdish language, the 
third and fourth highlight the processing difficulties and the last one examines 
the depth of resource-scarcity for Kurdish. 

2.1 Dialect Diversity 

The first and foremost challenge in processing Kurdish texts is its dialect di- 
versity. In this paper we focus on Kurmanji and Sorani which are the two most 
important Kurdish dialects in terms of number of speakers and degree of stan- 
dardization |3j. Together, they account for more than 75% of native Kurdish 
speakers [5]. 

The features distinguishing these two dialects are mainly morphological (the 
phonological differences are explained in the next section). The important mor- 
phological differences are |5l3j : 

— Kurmanji is more conservative in retaining both gender (feminine:masculinc) 
and case opposition (absolute:oblique) for nouns and pronouns. Sorani has 
largely abandoned this system and uses the pronominal suffixes to take over 
the functions of the casefO 

— in the past-tense transitive verbs, Kurmanji has the full ergative alignment 
but Sorani, having lost the oblique pronouns, resorts to pronominal enclitics. 

— in Sorani, passive and causative can be created exclusively via verb morphol- 
ogy, in Kurmanji they can also be formed with the verbs hat in (to come) 
and dan (to give) respectively. 

— the definite suffix -eke appears only in Sorani 

2.2 Script Diversity 

Due to geopolitical reasons, each of the two aforementioned dialects has been 
using its own writing system. In fact, Kurdish is considered a bi-standard lan- 
guage 2 , with Kurmanji written in Latin-based letters and Sorani written in 
Arabic-based letters. Both of these systems are almost totally phonetic [2]. As 
noted before, Sorani and Kurmanji are not morphologically identical and since 
these systems reflect the phonology of their corresponding dialects, there is no 
bijective mapping between them. In Figure [l] we have included three tables to 
demonstrate the non-trivial mappings between these two writing systems. It 
should be noted that the table in c) contains approximate equivalences. These 
mappings are in line with the list proposed in [3]. 

2.3 Normalization 

The Unicode assignments of the Arabic-based Kurdish alphabet has two poten- 
tial sources of ambiguity which should be dealt with carefully: 

3 Although there is evidence of gender distinctions weakening in some varieties of 
Kurmanji [3]. 
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a) From Latin-based to .Arabic-based b) From Arabic -based to Latin-based c) From Arabic-based to Latin-based tapprox.} 

Fig. 1. Non-Trivial Mappings between Arabic-based and Latin-based Kurdish Alphabets. 



— for some letters such as ye and ka there are more than one Unicode [7J. 
During the normalization phase, the occurrences of these multi-code letters 
should be unified. 

— as in Urdu, the isolated and final forms of the Arabic letter ha constitute 
one letter (pronounced e), whereas the initial and medial forms of the same 
Arabic letter constitute another letter (pronounced h), for which a different 
Unicode encoding is available |8l2j . In many electronic texts, these letters are 
written using only the ha, differentiated by using the zero-width non-joiner 
character that prevents a character from being joined to its follower. This 
distinction must be taken into account in the normalization phase. 

2.4 Segmentation and Tokenization 

Segmentation refers to the process of recognizing boundaries of text constituents, 
including sentences, phrases and words. Compared to Persian and Arabic, this 
process is relatively easier in Kurdish, mainly because short vowels are explicitly 
represented in the Kurdish writing systems. 

In fact, as discussed in [T], the absence of short vowels contributes most 
significantly to ambiguity in Arabic language, causing difficulty in homograph 
resolution, word sense disambiguation, part-of-speech detection. In Persian, its 
negative consequence is more visible in detecting the Izafe constructs [7J Q 

Despite incorporating short vowels, the Arabic-based Kurdish alphabet still 
suffers from two problems which are inherited from the Arabic writing system: 

— Arabic alphabet does not have capitalization and therefore it is more difficult 
to recognize sentence boundaries as well as recognizing Named Entities. 

— Space is not a deterministic delimiter and boundary sign [7j. It may appear 
within a word or between words, or may be absent between some sequential 
words. There are some proposals on how to tackle this issue in Persian [3] 
and Urdu 0. 

4 Izafe is an unstressed vocal -e or -i added between prepositions, nouns and adjec- 
tives in a phrase. It approximately corresponds to the English preposition of. This 
construct is frequently used in both Persian and Kurdish languages. 
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2.5 Lack of Language Resources 

Kurdish is a resource-scarce language for which the only linguistic resource avail- 
able on the Web is raw text [5]. 

More concretely, in spite the few attempts in building corpus [2; and lexi- 
con [8 , Kurdish still does not have any large-scale and reliable general/domain- 
specific corpus. Furthermore, no test collection -which is essential in evaluation 
of Information Retrieval systems- or stemming algorithm has been developed 
for Kurdish so far. 

Lastly, although Kurdish is well served with dictionaries [3], it still lacks a 
WordNet-like semantic lexicon. 

3 Conclusions 

Kurdish text processing poses a range of challenges. The most important one is 
the dialect/script diversity which has resulted in a bi-standard situation. As the 
examples in j2] show, the "same" word, when going from Kurmanji to Sorani, 
may at the same time go through several levels of change: writing systems, 
phonology, morphology, and sometimes semantics. This clearly shows that the 
mapping between this two dialects are more than transliteration, though less 
complicated than translation. Any text processing system designed for Kurdish 
language should develop and exploit a mapping between these two standards. 

On the technical side, providing the required processing techniques -through 
leveraging the existing techniques or designing new ones if needed- offers many 
avenues for future work. However, as a critical prerequisite to most of these 
tasks, a core set of language resources must be available first. At UoK, we have 
taken the first step and are currently working on building a large standard test 
collection for Kurdish language. 
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